Lesson 7. Attribute and Spatial Joins

Now that we understand the logic of spatial relationship queries, let’s take a look at another fundamental spatial operation that relies on them.

This operation, called a spatial join, is the process by which we can leverage the spatial relationships between distinct datasets to merge their information into a new output dataset.

This operation can be thought as the spatial equivalent of an attribute join, in which multiple tabular datasets can be merged by aligning matching values in a common column that they both contain. Thus, we’ll start by developing an understanding of this operation first!


Instructor Notes

library(sf)
library(tmap)
library(here)

7.0 Data Input and Prep

Let’s read in a table of data from the US Census’ 5-year American Community Survey (ACS5).

# Read in the ACS5 data for CA into an `sf` object.
# Note: We force the FIPS_11_digit to be read in as a string to preserve any leading zeroes.
acs5_df <- read.csv(here("notebook_data",
                         "census",
                         "ACS5yr",
                         "census_variables_CA_2018.csv"))

head(acs5_df)
##                                            NAME c_race c_white c_black c_asian
## 1  Census Tract 8.02, Merced County, California   3996    1609      50     231
## 2  Census Tract 9.01, Merced County, California   3836    1402      97      34
## 3 Census Tract 15.02, Merced County, California   2493     158      18     124
## 4  Census Tract 9.02, Merced County, California   9811    3752      87    1358
## 5    Census Tract 12, Merced County, California   5431    2187     137     358
## 6 Census Tract 14.01, Merced County, California   4503    1586      60     250
##   c_latinx c_race_moe c_white_moe c_black_moe c_asian_moe c_latinx_moe
## 1     2082        323         249          36         103          331
## 2     2220        495         186          46          25          480
## 3     2154        227         105          22          57          250
## 4     4172        796         863          83         621          536
## 5     2388        450         266         104         140          412
## 6     2415        470         358          44         207          353
##   state_fips county_fips tract_fips med_rent med_hhinc c_tenants c_owners
## 1          6          47        802     1215     59153      1331      911
## 2          6          47        901      772     57462       931      467
## 3          6          47       1502      679     31346       728      225
## 4          6          47        902     1026     59781      2684     1768
## 5          6          47       1200     1190     59984      1837     1125
## 6          6          47       1401      908     39879      1482      553
##   c_renters med_rent_moe med_hhinc_moe c_tenants_moe c_owners_moe c_renters_moe
## 1       420          234         10395            91          137           123
## 2       464           85          9764            86          104           109
## 3       503           37          3630            53           71            85
## 4       916          141         10238           149          235           233
## 5       712          133          5282            99          124           122
## 6       929           60          4669            69          100           110
##   c_movers c_stay c_movelocal c_movecounty c_movestate c_moveabroad
## 1     3943   3351         578           14           0            0
## 2     3783   3134         564           74           6            5
## 3     2469   2108         259           45          24           33
## 4     9762   8700         934           42           0           86
## 5     5372   4943         316           29          58           26
## 6     4408   3429         605          140         212           22
##   c_movers_moe c_stay_moe c_movelocal_moe c_movecounty_moe c_movestate_moe
## 1          320        371             296               16              12
## 2          491        470             139               41               8
## 3          225        277             154               63              41
## 4          797        864             606               50              17
## 5          437        470             210               30              64
## 6          474        448             324              112             189
##   c_moveabroad_moe c_commute c_car c_carpool c_transit c_bike c_walk
## 1               12      1662  1370       203        21      0      0
## 2                9      1325  1044        89         0      0    128
## 3               40       839   541        79         7     10     48
## 4               96      3685  3348       162         0      0     28
## 5               36      2257  1893       145         1      0     28
## 6               48      1603  1238       100        29      0     65
##   c_commute_moe c_car_moe c_carpool_moe c_transit_moe c_bike_moe c_walk_moe
## 1           231       199           135            23         12         12
## 2           279       280            61            12         12         59
## 3           160       125            50            12         18         60
## 4           378       406           120            17         17         46
## 5           260       247            78             5         17         32
## 6           250       241            62            38         12         49
##   year FIPS_11_digit    p_white     p_black     p_asian  p_latinx  p_owners
## 1 2018    6047000802 0.40265265 0.012512513 0.057807808 0.5210210 0.6844478
## 2 2018    6047000901 0.36548488 0.025286757 0.008863399 0.5787278 0.5016112
## 3 2018    6047001502 0.06337746 0.007220217 0.049739270 0.8640193 0.3090659
## 4 2018    6047000902 0.38242789 0.008867598 0.138416064 0.4252370 0.6587183
## 5 2018    6047001200 0.40268827 0.025225557 0.065917879 0.4396980 0.6124115
## 6 2018    6047001401 0.35220964 0.013324450 0.055518543 0.5363091 0.3731444
##   p_renters    p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 1 0.3155522 0.8498605  0.14658889  0.003550596 0.000000000  0.000000000
## 2 0.4983888 0.8284430  0.14908803  0.019561195 0.001586043  0.001321702
## 3 0.6909341 0.8537870  0.10490077  0.018226002 0.009720535  0.013365735
## 4 0.3412817 0.8912108  0.09567712  0.004302397 0.000000000  0.008809670
## 5 0.3875885 0.9201415  0.05882353  0.005398362 0.010796724  0.004839911
## 6 0.6268556 0.7779038  0.13725045  0.031760436 0.048094374  0.004990926
##       p_car  p_carpool   p_transit     p_bike      p_walk
## 1 0.8243081 0.12214200 0.012635379 0.00000000 0.000000000
## 2 0.7879245 0.06716981 0.000000000 0.00000000 0.096603774
## 3 0.6448153 0.09415971 0.008343266 0.01191895 0.057210965
## 4 0.9085482 0.04396201 0.000000000 0.00000000 0.007598372
## 5 0.8387240 0.06424457 0.000443066 0.00000000 0.012405848
## 6 0.7723019 0.06238303 0.018091079 0.00000000 0.040548971

Brief summary of the data:

Below is a table of the variables in this table. They were combined from different ACS 5 year tables.

NOTE:

  • variables that start with c_ are counts
  • variables that start with med_ are medians
  • variables that end in _moe are margin of error estimates
  • variables that start with p_ are proportions calcuated from the counts divided by the table denominator (the total count for whom that variable was assessed)
Variable Description
c_race Total population
c_white Total white non-Latinx
c_black Total black and African American non-Latinx
c_asian Total Asian non-Latinx
c_latinx Total Latinx
state_fips State level FIPS code
county_fips County level FIPS code
tract_fips Tract level FIPS code
med_rent Median rent
med_hhinc Median household income
c_tenants Total tenants
c_owners Total owners
c_renters Total renters
c_movers Total number of people who moved
c_stay Total number of people who stayed
c_movelocal Number of people who moved locally
c_movecounty Number of people who moved counties
c_movestate Number of people who moved states
c_moveabroad Number of people who moved abroad
c_commute Total number of commuters
c_car Number of commuters who use a car
c_carpool Number of commuters who carpool
c_transit Number of commuters who use public transit
c_bike Number of commuters who bike
c_walk Number of commuters who bike
year ACS data year
FIPS_11_digit 11-digit FIPS code
…. and more

We’re going to drop all of our moe columns by identifying all of those that end with _moe. We can do that in two steps, first by using filter to identify columns that contain the string _moe.

tidyverse will help with this!

library(tidyverse) 
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
acs5_df <- acs5_df %>% select(-contains("_moe"))

Unfortunately, when this dataset reads in, the 11-digit FIPS codes that should be strings actually read in as numerics, and thus the leading 0 gets truncated. We’re going to need those FIPS code in the correct format later, so let’s reformat them now.

# recast the FIPS 11-digit codes as strings, pasting a 0 at the front of each
acs5_df$FIPS_11_digit <- paste0('0', acs5_df$FIPS_11_digit)

And lastly, let’s grab only the rows for year 2018 and county FIPS code 1 (i.e., Alameda County)

acs5_df_ac <- acs5_df[acs5_df$year == 2018 & acs5_df$county_fips == 1, ]

Now, take another look at the dataframe

head(acs5_df_ac)
##                                                 NAME c_race c_white c_black
## 267 Census Tract 4415.01, Alameda County, California   6570     677     111
## 268    Census Tract 4047, Alameda County, California   2079    1515     134
## 269    Census Tract 4425, Alameda County, California   7748    1430     375
## 270    Census Tract 4503, Alameda County, California   5301    2597      96
## 271 Census Tract 4506.07, Alameda County, California   5971    2832     324
## 272    Census Tract 4049, Alameda County, California   3973    2383     394
##     c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 267    4740      570          6           1     441501     1883    152394
## 268     199      175          6           1     404700     3250    181000
## 269    3379     1904          6           1     442500     1999     95323
## 270    1077     1315          6           1     450300     2626    121131
## 271    1726      804          6           1     450607     1837     96061
## 272     442      337          6           1     404900     1446    111846
##     c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 267      1796     1634       162     6491   6010         257           68
## 268       790      680       110     2043   1822          58           77
## 269      2264     1245      1019     7628   6858         557           29
## 270      1814     1249       565     5249   4668         134          326
## 271      2445      883      1562     5905   4735         715          254
## 272      1874     1095       779     3939   3647         123          100
##     c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 267         129           27      2317  1765       264       127     28      8
## 268          65           21      1075   572       191       170      7      6
## 269          23          161      3330  2382       375       214     59     34
## 270         114            7      2819  2112       183       265     28     33
## 271         201            0      3315  2316       440       193     71    114
## 272          34           35      2211  1030       303       522     48      0
##     year FIPS_11_digit   p_white    p_black   p_asian   p_latinx  p_owners
## 267 2018   06001441501 0.1030441 0.01689498 0.7214612 0.08675799 0.9097996
## 268 2018   06001404700 0.7287157 0.06445406 0.0957191 0.08417508 0.8607595
## 269 2018   06001442500 0.1845638 0.04839959 0.4361125 0.24574084 0.5499117
## 270 2018   06001450300 0.4899076 0.01810979 0.2031692 0.24806640 0.6885336
## 271 2018   06001450607 0.4742924 0.05426227 0.2890638 0.13465081 0.3611452
## 272 2018   06001404900 0.5997986 0.09916939 0.1112509 0.08482255 0.5843116
##      p_renters    p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 267 0.09020045 0.9258974  0.03959328  0.010476044 0.019873671  0.004159606
## 268 0.13924051 0.8918257  0.02838962  0.037689672 0.031815957  0.010279001
## 269 0.45008834 0.8990561  0.07302045  0.003801783 0.003015207  0.021106450
## 270 0.31146637 0.8893122  0.02552867  0.062107068 0.021718423  0.001333587
## 271 0.63885481 0.8018628  0.12108383  0.043014395 0.034038950  0.000000000
## 272 0.41568837 0.9258695  0.03122620  0.025387154 0.008631632  0.008885504
##         p_car  p_carpool  p_transit      p_bike      p_walk
## 267 0.7617609 0.11394044 0.05481226 0.012084592 0.003452741
## 268 0.5320930 0.17767442 0.15813953 0.006511628 0.005581395
## 269 0.7153153 0.11261261 0.06426426 0.017717718 0.010210210
## 270 0.7492018 0.06491664 0.09400497 0.009932600 0.011706279
## 271 0.6986425 0.13273002 0.05822021 0.021417798 0.034389140
## 272 0.4658526 0.13704206 0.23609227 0.021709634 0.000000000

Now let’s also read in our census tracts again!

tracts_sf = st_read(here("notebook_data",
                         "census",
                         "Tracts",
                         "cb_2018_06_tract_500k.shp"))
## Reading layer `cb_2018_06_tract_500k' from data source 
##   `C:\Users\kathe\berkeley_drive\analyses\dlab\Geospatial-Fundamentals-in-R-with-sf\notebook_data\census\Tracts\cb_2018_06_tract_500k.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 8041 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS:  NAD83
head(tracts_sf)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.5009 ymin: 37.92535 xmax: -120.3345 ymax: 39.30875
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME LSAD
## 1      06      009  000300 1400000US06009000300 06009000300       3   CT
## 2      06      011  000300 1400000US06011000300 06011000300       3   CT
## 3      06      013  303102 1400000US06013303102 06013303102 3031.02   CT
## 4      06      013  303202 1400000US06013303202 06013303202 3032.02   CT
## 5      06      013  303203 1400000US06013303203 06013303203 3032.03   CT
## 6      06      013  307102 1400000US06013307102 06013307102 3071.02   CT
##       ALAND AWATER                       geometry
## 1 457009794 394122 MULTIPOLYGON (((-120.764 38...
## 2 952744514 195376 MULTIPOLYGON (((-122.5001 3...
## 3   6507019      0 MULTIPOLYGON (((-121.7294 3...
## 4   3725528      0 MULTIPOLYGON (((-121.7235 3...
## 5   6354210      0 MULTIPOLYGON (((-121.7449 3...
## 6   1603153      0 MULTIPOLYGON (((-121.8226 3...
tracts_sf_ac = tracts_sf[tracts_sf$COUNTYFP == '001',]
plot(tracts_sf_ac$geometry)

7.1 Attribute Joins

Attribute Joins between sf data.frames and plain data.frames

We just mapped the census tracts. But what makes a map powerful is when you map the data associated with the locations.

  • tracts_sf_ac: These are polygon data in this sf data.frame. However, as we saw with the head command, it contains no attributes of interest!

  • acs5_df_ac: These are 2018 ACS data attributes for CA census tracts read in from a CSV file (‘census_variables_CA_2018.csv’), into a regular data.frame. However, this data has no geometry column!

In order to map the ACS data we need to associate it with the tracts. We can do that by joining the columns from acs5_df_ac to the columns of tracts_gdf_ac using a common column as the key for matching rows. This process is called an attribute join.

Question

The image above gives us a nice conceptual summary of the types of joins we could run.

  1. In general, why might we choose one type of join over another?
  2. In our case, do we want an inner, left, right, or outer (AKA ‘full’) join?

(NOTE: You can read more about merging sf and plain data.frames here.)

Okay, here we go!

Let’s take a look at the common column in both our data.frames.

head(tracts_sf_ac['GEOID'])
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.3159 ymin: 37.66497 xmax: -122.1062 ymax: 37.84792
## Geodetic CRS:  NAD83
##          GEOID                       geometry
## 27 06001425101 MULTIPOLYGON (((-122.3142 3...
## 28 06001428600 MULTIPOLYGON (((-122.2799 3...
## 29 06001432600 MULTIPOLYGON (((-122.1675 3...
## 30 06001433200 MULTIPOLYGON (((-122.1667 3...
## 31 06001433900 MULTIPOLYGON (((-122.1209 3...
## 32 06001436100 MULTIPOLYGON (((-122.1329 3...
head(acs5_df_ac['FIPS_11_digit'])
##     FIPS_11_digit
## 267   06001441501
## 268   06001404700
## 269   06001442500
## 270   06001450300
## 271   06001450607
## 272   06001404900

Note that they are not named the same thing.

    That's okay! We just need to know that they contain the same information.

Also note that they are not in the same order.

    That's not only okay... That's the point! (If they were in the same order already then we could just join them side by side, without having R find and line up the matching rows from each!)

Let’s do a left join to keep all of the census tracts in Alameda County and only the ACS data for those tracts.

NOTE: To figure out how to do this we could always take a peek at the documentation by calling ?base::merge.

?base::merge
## starting httpd help server ... done
# left join keeps all tracts and the acs data for those tracts
tracts_acs_sf_ac <- base::merge(tracts_sf_ac, 
                                acs5_df_ac, 
                                by.x = 'GEOID', 
                                by.y = "FIPS_11_digit", 
                                all.x = TRUE)

# class of the  output data object
class(tracts_acs_sf_ac)
## [1] "sf"         "data.frame"

What is the class of the output if you join the tracts to the ACS data (i.e., reverse the order of the inputs)?

acs_and_tracts_ac <- base::merge(acs5_df_ac, 
                                 tracts_sf_ac, 
                                 by.y = 'GEOID', 
                                 by.x = "FIPS_11_digit", 
                                 all.x = TRUE)

class(acs_and_tracts_ac)
## [1] "data.frame"

ORDER MATTERS with base::merge

  • If you use base::merge to join a dataframe to an spatial dataframe the output is a spatial df.

  • If you use base::merge to join a spatial dataframe to a dataframe the output is as dataframe!

Take a look at the output

# take a look at the output
head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD   ALAND
## 1 06001400100      06      001  400100 1400000US06001400100   4001   CT 6894340
## 2 06001400200      06      001  400200 1400000US06001400200   4002   CT  586561
## 3 06001400300      06      001  400300 1400000US06001400300   4003   CT 1105851
## 4 06001400400      06      001  400400 1400000US06001400400   4004   CT  715616
## 5 06001400500      06      001  400500 1400000US06001400500   4005   CT  590307
## 6 06001400600      06      001  400600 1400000US06001400600   4006   CT  297856
##   AWATER                                        NAME.y c_race c_white c_black
## 1      0 Census Tract 4001, Alameda County, California   3115    2055     128
## 2      0 Census Tract 4002, Alameda County, California   2025    1436      59
## 3      0 Census Tract 4003, Alameda County, California   5000    3458     380
## 4      0 Census Tract 4004, Alameda County, California   3843    2551     229
## 5      0 Census Tract 4005, Alameda County, California   3786    1885     990
## 6      0 Census Tract 4006, Alameda County, California   1638     817     343
##   c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1     592      104          6           1     400100     3501    200893
## 2     183      178          6           1     400200     1902    160536
## 3     535      364          6           1     400300     1481     94732
## 4     373      420          6           1     400400     1624    113036
## 5     264      334          6           1     400500     1627    103846
## 6     144      124          6           1     400600     1640    127232
##   c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1      1297     1158       139     3084   2514         202          120
## 2       855      532       323     2012   1827          40           46
## 3      2441     1057      1384     4948   4159         223          289
## 4      1750      653      1097     3754   3440         113          133
## 5      1622      605      1017     3745   3065         309          180
## 6       657      283       374     1587   1221         152          146
##   c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1         219           29      1542   864       165       271      0     10
## 2          39           60      1211   500        40       426     62     57
## 3         156          121      2975  1252       177       835    202    171
## 4          46           22      2293   933       240       588    188     92
## 5         145           46      2325   983       156       619    177     94
## 6          68            0      1022   370       120       268     93      3
##   year   p_white    p_black    p_asian   p_latinx  p_owners p_renters    p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
##   p_movelocal p_movecounty p_movestate p_moveabroad     p_car  p_carpool
## 1  0.06549935   0.03891051  0.07101167  0.009403372 0.5603113 0.10700389
## 2  0.01988072   0.02286282  0.01938370  0.029821074 0.4128819 0.03303055
## 3  0.04506871   0.05840744  0.03152789  0.024454325 0.4208403 0.05949580
## 4  0.03010123   0.03542888  0.01225360  0.005860416 0.4068905 0.10466638
## 5  0.08251001   0.04806409  0.03871829  0.012283044 0.4227957 0.06709677
## 6  0.09577820   0.09199748  0.04284814  0.000000000 0.3620352 0.11741683
##   p_transit     p_bike      p_walk                       geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...

Let’s check that we have all the variables we have in our dataset now.

colnames(tracts_acs_sf_ac)
##  [1] "GEOID"        "STATEFP"      "COUNTYFP"     "TRACTCE"      "AFFGEOID"    
##  [6] "NAME.x"       "LSAD"         "ALAND"        "AWATER"       "NAME.y"      
## [11] "c_race"       "c_white"      "c_black"      "c_asian"      "c_latinx"    
## [16] "state_fips"   "county_fips"  "tract_fips"   "med_rent"     "med_hhinc"   
## [21] "c_tenants"    "c_owners"     "c_renters"    "c_movers"     "c_stay"      
## [26] "c_movelocal"  "c_movecounty" "c_movestate"  "c_moveabroad" "c_commute"   
## [31] "c_car"        "c_carpool"    "c_transit"    "c_bike"       "c_walk"      
## [36] "year"         "p_white"      "p_black"      "p_asian"      "p_latinx"    
## [41] "p_owners"     "p_renters"    "p_stay"       "p_movelocal"  "p_movecounty"
## [46] "p_movestate"  "p_moveabroad" "p_car"        "p_carpool"    "p_transit"   
## [51] "p_bike"       "p_walk"       "geometry"

Question

It’s always important to run sanity checks on our results, at each step of the way!

In this case, how many rows and columns should we have?

print("Rows and columns in the Alameda County Census tract spatial dataframe:")
## [1] "Rows and columns in the Alameda County Census tract spatial dataframe:"
print(dim(tracts_sf_ac))
## [1] 360  10
print("Row and columns in the ACS5 2018 data for Alameda County:")
## [1] "Row and columns in the ACS5 2018 data for Alameda County:"
print(dim(acs5_df_ac))
## [1] 361  44
print("Rows and columns in the Alameda County Census tract spatial df joined to the ACS data:")
## [1] "Rows and columns in the Alameda County Census tract spatial df joined to the ACS data:"
print(dim(tracts_acs_sf_ac))
## [1] 360  53

Let’s save out our merged data so we can use it in the final notebook.

st_write(tracts_acs_sf_ac, 
         here("outdata", "tracts_acs_sdf_ac.json"), 
         driver = 'GeoJSON', 
         delete_dsn = T)
## Deleting source `C:/Users/kathe/berkeley_drive/analyses/dlab/Geospatial-Fundamentals-in-R-with-sf/outdata/tracts_acs_sdf_ac.json' using driver `GeoJSON'
## Writing layer `tracts_acs_sdf_ac' to data source 
##   `C:/Users/kathe/berkeley_drive/analyses/dlab/Geospatial-Fundamentals-in-R-with-sf/outdata/tracts_acs_sdf_ac.json' using driver `GeoJSON'
## Writing 360 features with 52 fields and geometry type Multi Polygon.

Exercise: Choropleth Map

We can now make choropleth maps using our attribute joined geodataframe. Go ahead and pick one variable to color the map, then map it using tmap (since it’s too easy using the plot method). You can go back to lesson 5 if you need a refresher on how to make this!

To see the solution, look at the hidden text below.

head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD   ALAND
## 1 06001400100      06      001  400100 1400000US06001400100   4001   CT 6894340
## 2 06001400200      06      001  400200 1400000US06001400200   4002   CT  586561
## 3 06001400300      06      001  400300 1400000US06001400300   4003   CT 1105851
## 4 06001400400      06      001  400400 1400000US06001400400   4004   CT  715616
## 5 06001400500      06      001  400500 1400000US06001400500   4005   CT  590307
## 6 06001400600      06      001  400600 1400000US06001400600   4006   CT  297856
##   AWATER                                        NAME.y c_race c_white c_black
## 1      0 Census Tract 4001, Alameda County, California   3115    2055     128
## 2      0 Census Tract 4002, Alameda County, California   2025    1436      59
## 3      0 Census Tract 4003, Alameda County, California   5000    3458     380
## 4      0 Census Tract 4004, Alameda County, California   3843    2551     229
## 5      0 Census Tract 4005, Alameda County, California   3786    1885     990
## 6      0 Census Tract 4006, Alameda County, California   1638     817     343
##   c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1     592      104          6           1     400100     3501    200893
## 2     183      178          6           1     400200     1902    160536
## 3     535      364          6           1     400300     1481     94732
## 4     373      420          6           1     400400     1624    113036
## 5     264      334          6           1     400500     1627    103846
## 6     144      124          6           1     400600     1640    127232
##   c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1      1297     1158       139     3084   2514         202          120
## 2       855      532       323     2012   1827          40           46
## 3      2441     1057      1384     4948   4159         223          289
## 4      1750      653      1097     3754   3440         113          133
## 5      1622      605      1017     3745   3065         309          180
## 6       657      283       374     1587   1221         152          146
##   c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1         219           29      1542   864       165       271      0     10
## 2          39           60      1211   500        40       426     62     57
## 3         156          121      2975  1252       177       835    202    171
## 4          46           22      2293   933       240       588    188     92
## 5         145           46      2325   983       156       619    177     94
## 6          68            0      1022   370       120       268     93      3
##   year   p_white    p_black    p_asian   p_latinx  p_owners p_renters    p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
##   p_movelocal p_movecounty p_movestate p_moveabroad     p_car  p_carpool
## 1  0.06549935   0.03891051  0.07101167  0.009403372 0.5603113 0.10700389
## 2  0.01988072   0.02286282  0.01938370  0.029821074 0.4128819 0.03303055
## 3  0.04506871   0.05840744  0.03152789  0.024454325 0.4208403 0.05949580
## 4  0.03010123   0.03542888  0.01225360  0.005860416 0.4068905 0.10466638
## 5  0.08251001   0.04806409  0.03871829  0.012283044 0.4227957 0.06709677
## 6  0.09577820   0.09199748  0.04284814  0.000000000 0.3620352 0.11741683
##   p_transit     p_bike      p_walk                       geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
# YOUR CODE HERE

Solution hidden here!

To see it, right-click and select “inspect element” in your browser (or look in the 07_Joins_and_Aggregations.Rmd file near line 311).

7.2 Spatial Joins

Great! We’ve wrapped our heads around the concept of an attribute join.

Now let’s extend that concept to its spatially explicit equivalent: the spatial join!


To start, we’ll read in Alameda County schools data.

Then we’ll work with that data and our tracts_acs_sf_ac data together.

# read in school data from a csv file
schools_df <- read.csv(here("notebook_data",
                            "alco_schools.csv"))

# promote to a spatial df
schools_sf <- st_as_sf(schools_df, coords = c('X', 'Y'), crs = 4326)

Let’s check if we have to transform the schools to match thetracts_acs_sf_ac’s CRS.

print('schools_sf CRS:')
## [1] "schools_sf CRS:"
print(st_crs(schools_sf))
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
print('tracts_acs_sf_ac CRS:')
## [1] "tracts_acs_sf_ac CRS:"
print(st_crs(tracts_acs_sf_ac))
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Yes we do! Let’s do that.

NOTE: Explicit syntax aiming at that dataset’s CRS leaves less room for human error!

schools_sf = st_transform(schools_sf, st_crs(tracts_acs_sf_ac))

print('schools_sf CRS:')
## [1] "schools_sf CRS:"
print(st_crs(schools_sf))
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]
print('tracts_acs_sf_ac CRS:')
## [1] "tracts_acs_sf_ac CRS:"
print(st_crs(tracts_acs_sf_ac))
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Now we’re ready to combine the datasets in an analysis.

In this case, we want to get data from the census tract within which each school is located.

But how can we do that? The two datasets don’t share a common column to use for a join.

colnames(tracts_acs_sf_ac)
##  [1] "GEOID"        "STATEFP"      "COUNTYFP"     "TRACTCE"      "AFFGEOID"    
##  [6] "NAME.x"       "LSAD"         "ALAND"        "AWATER"       "NAME.y"      
## [11] "c_race"       "c_white"      "c_black"      "c_asian"      "c_latinx"    
## [16] "state_fips"   "county_fips"  "tract_fips"   "med_rent"     "med_hhinc"   
## [21] "c_tenants"    "c_owners"     "c_renters"    "c_movers"     "c_stay"      
## [26] "c_movelocal"  "c_movecounty" "c_movestate"  "c_moveabroad" "c_commute"   
## [31] "c_car"        "c_carpool"    "c_transit"    "c_bike"       "c_walk"      
## [36] "year"         "p_white"      "p_black"      "p_asian"      "p_latinx"    
## [41] "p_owners"     "p_renters"    "p_stay"       "p_movelocal"  "p_movecounty"
## [46] "p_movestate"  "p_moveabroad" "p_car"        "p_carpool"    "p_transit"   
## [51] "p_bike"       "p_walk"       "geometry"
colnames(schools_sf)
## [1] "Site"     "Address"  "City"     "State"    "Type"     "API"      "Org"     
## [8] "geometry"

However, they do have a shared relationship by way of space!

So, we’ll use a spatial relationship query to figure out the census tract that each school is in, then associate the tract’s data with that school (as additional data in the school’s row). This is a spatial join!

Census Tract Data Associated with Each School

In this case, let’s say we’re interested in the relationship between the median household income in a census tract (tracts_acs_sf_ac$med_hhinc) and a school’s Academic Performance Index (schools_gdf$API).

To start, let’s take a look at the distributions of our two variables of interest.

head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD   ALAND
## 1 06001400100      06      001  400100 1400000US06001400100   4001   CT 6894340
## 2 06001400200      06      001  400200 1400000US06001400200   4002   CT  586561
## 3 06001400300      06      001  400300 1400000US06001400300   4003   CT 1105851
## 4 06001400400      06      001  400400 1400000US06001400400   4004   CT  715616
## 5 06001400500      06      001  400500 1400000US06001400500   4005   CT  590307
## 6 06001400600      06      001  400600 1400000US06001400600   4006   CT  297856
##   AWATER                                        NAME.y c_race c_white c_black
## 1      0 Census Tract 4001, Alameda County, California   3115    2055     128
## 2      0 Census Tract 4002, Alameda County, California   2025    1436      59
## 3      0 Census Tract 4003, Alameda County, California   5000    3458     380
## 4      0 Census Tract 4004, Alameda County, California   3843    2551     229
## 5      0 Census Tract 4005, Alameda County, California   3786    1885     990
## 6      0 Census Tract 4006, Alameda County, California   1638     817     343
##   c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1     592      104          6           1     400100     3501    200893
## 2     183      178          6           1     400200     1902    160536
## 3     535      364          6           1     400300     1481     94732
## 4     373      420          6           1     400400     1624    113036
## 5     264      334          6           1     400500     1627    103846
## 6     144      124          6           1     400600     1640    127232
##   c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1      1297     1158       139     3084   2514         202          120
## 2       855      532       323     2012   1827          40           46
## 3      2441     1057      1384     4948   4159         223          289
## 4      1750      653      1097     3754   3440         113          133
## 5      1622      605      1017     3745   3065         309          180
## 6       657      283       374     1587   1221         152          146
##   c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1         219           29      1542   864       165       271      0     10
## 2          39           60      1211   500        40       426     62     57
## 3         156          121      2975  1252       177       835    202    171
## 4          46           22      2293   933       240       588    188     92
## 5         145           46      2325   983       156       619    177     94
## 6          68            0      1022   370       120       268     93      3
##   year   p_white    p_black    p_asian   p_latinx  p_owners p_renters    p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
##   p_movelocal p_movecounty p_movestate p_moveabroad     p_car  p_carpool
## 1  0.06549935   0.03891051  0.07101167  0.009403372 0.5603113 0.10700389
## 2  0.01988072   0.02286282  0.01938370  0.029821074 0.4128819 0.03303055
## 3  0.04506871   0.05840744  0.03152789  0.024454325 0.4208403 0.05949580
## 4  0.03010123   0.03542888  0.01225360  0.005860416 0.4068905 0.10466638
## 5  0.08251001   0.04806409  0.03871829  0.012283044 0.4227957 0.06709677
## 6  0.09577820   0.09199748  0.04284814  0.000000000 0.3620352 0.11741683
##   p_transit     p_bike      p_walk                       geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
hist(tracts_acs_sf_ac$med_hhinc)

hist(schools_sf$API)

Oh, right! Those pesky schools with no reported APIs (i.e. API == 0)! Let’s drop those.

schools_sf_api = schools_sf[schools_sf$API > 0, ]
hist(schools_sf_api$API)

Much better!

Now, maybe we think there ought to be some correlation between the two variables? As a first pass at this possibility, let’s overlay the two datasets, coloring each one by its variable of interest. This should give us a sense of whether or not similar values co-occur.

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(tracts_acs_sf_ac) + 
  tm_polygons(col = 'med_hhinc',
              border.col = "white",
              palette = 'RdYlGn',
              style = "jenks") + 
tm_shape(schools_sf_api) + 
  tm_dots(col = 'API',
          palette = 'RdYlGn',
          border.col = "black",
          style = "jenks",
          size = 0.15)

Spatially Joining our Schools and Census Tracts

Though it’s hard to say for sure, it certainly looks possible. It would be ideal to scatterplot the variables! But in order to do that, we need to know the median household income in each school’s tract, which means we definitely need our spatial join!

We’ll first take a look at the documentation for the spatial join function, st_join.

?st_join

Looks like the key arguments to consider are:

  • the two sf data.frames to be spatially joined (x and y)

  • the type of join to execute (left=), which is a left join if TRUE (default), or an inner join if FALSE

  • the spatial relationship query to use in the join (join=), which by default is st_intersects

NOTES:

  • By default st_join is a left join

  • By default st_join maintains the geometries of the first data.frame input to the operation (i.e. the geometries of x).

When spatially joining two sf dataframes with st_join, the spatial dataframe whose geometry you want to keep should be listed first!

Question

  1. Which sf data.frame are we joining onto which (i.e. which one is getting the other one’s data added to it)?
  2. What happened to ‘outer’ as a join type?
  3. Thus, in our operation, which sf data.frame should be x, which should be y, and should left be TRUE or FALSE?

Alright! Let’s run our join!

schools_join_tracts <- st_join(schools_sf_api, tracts_acs_sf_ac)

# We don't need to specify default arguments!
# schools_join_tracts <- st_join(schools_sf_api, 
#                                tracts_acs_sf_ac, 
#                                left = T, 
#                                join = st_within)

Checking Our Output


Question

As always, we want to sanity-check our intermediate result before we rush ahead.

One way to do that is to introspect the structure of the result object a bit.

  1. What type of object should that have given us?
  2. What should the dimensions of that object be, and why?
  3. If we wanted a visual check of our results (i.e., a plot or map), what could we do?
print(dim(schools_join_tracts))   # the join output
## [1] 325  60
print(dim(schools_sf_api))       # the input schools
## [1] 325   8
print(dim(tracts_acs_sf_ac))     # the input tracts
## [1] 360  53
head(schools_join_tracts)
## Simple feature collection with 6 features and 59 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS:  NAD83
##                        Site               Address    City State Type API    Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda    CA   ES 933 Public
## 2       Bay Farm Elementary   200 Aughinbaugh Way Alameda    CA   ES 932 Public
## 3  Donald D. Lum Elementary    1801 Sandcreek Way Alameda    CA   ES 853 Public
## 4         Edison Elementary  2700 Buena Vista Ave Alameda    CA   ES 927 Public
## 5     Frank Otis Elementary      3010 Fillmore St Alameda    CA   ES 894 Public
## 6       Franklin Elementary  1433 San Antonio Ave Alameda    CA   ES 893 Public
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID  NAME.x LSAD
## 1 06001428302      06      001  428302 1400000US06001428302 4283.02   CT
## 2 06001428302      06      001  428302 1400000US06001428302 4283.02   CT
## 3 06001428500      06      001  428500 1400000US06001428500    4285   CT
## 4 06001427100      06      001  427100 1400000US06001427100    4271   CT
## 5 06001428200      06      001  428200 1400000US06001428200    4282   CT
## 6 06001427900      06      001  427900 1400000US06001427900    4279   CT
##     ALAND AWATER                                           NAME.y c_race
## 1 2234425 474770 Census Tract 4283.02, Alameda County, California   7501
## 2 2234425 474770 Census Tract 4283.02, Alameda County, California   7501
## 3  624197 383522    Census Tract 4285, Alameda County, California   3152
## 4 1025446 168187    Census Tract 4271, Alameda County, California   3768
## 5 1362067 804736    Census Tract 4282, Alameda County, California   6643
## 6  825214      0    Census Tract 4279, Alameda County, California   5031
##   c_white c_black c_asian c_latinx state_fips county_fips tract_fips med_rent
## 1    3114     273    3070      562          6           1     428302     1946
## 2    3114     273    3070      562          6           1     428302     1946
## 3    1293     268     940      378          6           1     428500     1873
## 4    2446      67     574      457          6           1     427100     1767
## 5    3468     387    1589      686          6           1     428200     1855
## 6    2554     228    1407      563          6           1     427900     1712
##   med_hhinc c_tenants c_owners c_renters c_movers c_stay c_movelocal
## 1    147667      2540     2015       525     7436   6705         395
## 2    147667      2540     2015       525     7436   6705         395
## 3     88333      1422      485       937     3125   2641         282
## 4    135833      1420     1040       380     3724   3498         121
## 5    103781      2641     1468      1173     6587   6155         205
## 6    102077      2030      734      1296     4971   4470         108
##   c_movecounty c_movestate c_moveabroad c_commute c_car c_carpool c_transit
## 1           99         175           62      3812  2595       296       409
## 2           99         175           62      3812  2595       296       409
## 3          102         100            0      1514   910        65       374
## 4           86           0           19      1755   986       136       303
## 5           71         141           15      3391  2189       229       510
## 6          244         130           19      3102  1705       380       624
##   c_bike c_walk year   p_white    p_black   p_asian   p_latinx  p_owners
## 1     18     73 2018 0.4151446 0.03639515 0.4092788 0.07492334 0.7933071
## 2     18     73 2018 0.4151446 0.03639515 0.4092788 0.07492334 0.7933071
## 3     50     18 2018 0.4102157 0.08502538 0.2982234 0.11992386 0.3410689
## 4     33     64 2018 0.6491507 0.01778132 0.1523355 0.12128450 0.7323944
## 5     51    108 2018 0.5220533 0.05825681 0.2391992 0.10326660 0.5558501
## 6     41     52 2018 0.5076526 0.04531902 0.2796661 0.11190618 0.3615764
##   p_renters    p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 1 0.2066929 0.9016945  0.05311996   0.01331361  0.02353416  0.008337816
## 2 0.2066929 0.9016945  0.05311996   0.01331361  0.02353416  0.008337816
## 3 0.6589311 0.8451200  0.09024000   0.03264000  0.03200000  0.000000000
## 4 0.2676056 0.9393126  0.03249194   0.02309345  0.00000000  0.005102041
## 5 0.4441499 0.9344163  0.03112191   0.01077881  0.02140580  0.002277213
## 6 0.6384236 0.8992154  0.02172601   0.04908469  0.02615168  0.003822169
##       p_car  p_carpool p_transit      p_bike     p_walk
## 1 0.6807450 0.07764953 0.1072928 0.004721931 0.01915005
## 2 0.6807450 0.07764953 0.1072928 0.004721931 0.01915005
## 3 0.6010568 0.04293263 0.2470277 0.033025099 0.01188904
## 4 0.5618234 0.07749288 0.1726496 0.018803419 0.03646724
## 5 0.6455323 0.06753170 0.1503981 0.015039811 0.03184901
## 6 0.5496454 0.12250161 0.2011605 0.013217279 0.01676338
##                     geometry
## 1 POINT (-122.2388 37.74476)
## 2   POINT (-122.2519 37.739)
## 3 POINT (-122.2589 37.76206)
## 4 POINT (-122.2348 37.76525)
## 5 POINT (-122.2381 37.75396)
## 6 POINT (-122.2616 37.76911)

Confirmed! The output of the our st_join operation is an sf data.frame (schools_join_tracts) with

  • a row for each school that is located inside a census tract (all of them are)
  • the point geometry of that school
  • all of the attribute data columns (non-geometry columns) from both input sf data.frames

Let’s also take a look at an overlay map of the schools on the tracts. If we color the schools categorically by their tract IDs, then we should see that all schools within a given tract polygon are the same color.

tm_shape(tracts_acs_sf_ac) + 
  tm_polygons(col = 'white', border.col = 'black') + 
tm_shape(schools_join_tracts) + 
  tm_dots(col = 'GEOID', size = 0.2)
## Warning: Number of levels of the variable "GEOID" is 208, which is
## larger than max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 208) in the layer function to show all levels.

Assessing the Relationship between Median Household Income and API

Fantastic! That looks right!

Now we can create that scatterplot we were thinking about!

plot(schools_join_tracts$med_hhinc, 
     schools_join_tracts$API,
     xlab = 'median household income ($)',
     ylab = 'API')

Wow! Just as we suspected based on our overlay map, there’s a pretty obvious, strong, and positive correlation between median household income in a school’s tract and the school’s API.

Spatial Join Revisited

Keep it simple

Instead of joining everything in the ACS data to the schools you can just pick one var!

#simple spatial join
schools_join_tracts2 = st_join(schools_sf_api, tracts_acs_sf_ac['med_hhinc'])
head(schools_join_tracts2)
## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS:  NAD83
##                        Site               Address    City State Type API    Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda    CA   ES 933 Public
## 2       Bay Farm Elementary   200 Aughinbaugh Way Alameda    CA   ES 932 Public
## 3  Donald D. Lum Elementary    1801 Sandcreek Way Alameda    CA   ES 853 Public
## 4         Edison Elementary  2700 Buena Vista Ave Alameda    CA   ES 927 Public
## 5     Frank Otis Elementary      3010 Fillmore St Alameda    CA   ES 894 Public
## 6       Franklin Elementary  1433 San Antonio Ave Alameda    CA   ES 893 Public
##   med_hhinc                   geometry
## 1    147667 POINT (-122.2388 37.74476)
## 2    147667   POINT (-122.2519 37.739)
## 3     88333 POINT (-122.2589 37.76206)
## 4    135833 POINT (-122.2348 37.76525)
## 5    103781 POINT (-122.2381 37.75396)
## 6    102077 POINT (-122.2616 37.76911)

7.3: Aggregation

We just saw that a spatial join in one way to leverage the spatial relationship between two datasets in order to create a new, composite dataset.

An aggregation is another way we can generate new data from this relationship. In this case, for each feature in one dataset we find all the features in another dataset that satisfy our chosen spatial relationship query with it (e.g. within, intersects), then aggregate one or more of the joined output variables using some summary function (e.g. count, mean).

Calculating the Mean API Within Each Census Tract

In our spatial join exercise the output suggested that API scores are related to median household income. However, many census tracts have more than one school. So, a mean API score per census tract would be a better way to explore this relationship. Let’s use a spatial data aggregation to calculate that value.

tracts_meanAPI = sf:::aggregate.sf(x = schools_sf_api['API'], 
                                   by = tracts_acs_sf_ac, 
                                   FUN = mean)

head(tracts_meanAPI)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##   API                       geometry
## 1 864 MULTIPOLYGON (((-122.2469 3...
## 2 703 MULTIPOLYGON (((-122.2574 3...
## 3  NA MULTIPOLYGON (((-122.2642 3...
## 4 892 MULTIPOLYGON (((-122.2618 3...
## 5  NA MULTIPOLYGON (((-122.2694 3...
## 6  NA MULTIPOLYGON (((-122.2681 3...

Let’s plot the results

# plot the tracts, coloring them by *mean* API
tm_shape(tracts_acs_sf_ac) + 
  tm_polygons(col = 'med_hhinc',
              border.col = 'white',
              palette = 'RdYlGn',
              style = 'jenks',
              title = 'Median HHInc by Tract') + 
  
# now plot the tracts as points, colored by med HHI
tm_shape(tracts_meanAPI) + 
  tm_dots(col = 'API',
          border.col = 'black',
          palette = 'RdYlGn',
          style = 'jenks',
          size = 0.1,
          title = 'Mean API by Tract')

Now let’s join our spatially aggregated API Mean back to the census tract sf dataframe tracts_acs_sf_ac so that we can create a scatter plot of the data.

#simple spatial join of one var
# first rename the col before the join bc alread have an API col
colnames(tracts_meanAPI) <- c('mean_API', 'geometry')
tracts_acs_sf_ac2 <- st_join(tracts_acs_sf_ac, tracts_meanAPI['mean_API'])

plot(tracts_acs_sf_ac2$med_hhinc, 
     tracts_acs_sf_ac2$mean_API,
     xlab = 'median household income ($)',
     ylab = 'Mean API')

Exercise: Spatial Joins and Aggregations

Getting the Aggregated School Counts

In this exercise, you will practice spatial aggregation and spatial joins.

  1. Aggregate the number of schools (schools_sf) within each census tract with the sum function.
  2. Join the count back to the Alameda County census tract data frame tracts_acs_sf_ac2

Note: to make this easier, we will add a dummy variable “count” to the schools data and set it to 1.

schools_sf$school_count = 1  # Add a new variable
head(schools_sf)  # Take a look
## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS:  NAD83
##                        Site               Address    City State Type API    Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda    CA   ES 933 Public
## 2       Bay Farm Elementary   200 Aughinbaugh Way Alameda    CA   ES 932 Public
## 3  Donald D. Lum Elementary    1801 Sandcreek Way Alameda    CA   ES 853 Public
## 4         Edison Elementary  2700 Buena Vista Ave Alameda    CA   ES 927 Public
## 5     Frank Otis Elementary      3010 Fillmore St Alameda    CA   ES 894 Public
## 6       Franklin Elementary  1433 San Antonio Ave Alameda    CA   ES 893 Public
##                     geometry school_count
## 1 POINT (-122.2388 37.74476)            1
## 2   POINT (-122.2519 37.739)            1
## 3 POINT (-122.2589 37.76206)            1
## 4 POINT (-122.2348 37.76525)            1
## 5 POINT (-122.2381 37.75396)            1
## 6 POINT (-122.2616 37.76911)            1

Your code here!

# YOUR CODE HERE

# aggregate the number of schools (`schools_sf`) within each census tract with the `sum` function
# and save to a new sf dataframe called `school_counts_by_tract`
school_counts_by_tract = sf:::aggregate.sf(x = schools_sf['school_count'], 
                                           by = tracts_acs_sf_ac, 
                                           FUN = sum)

# take a look
head(school_counts_by_tract)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##   school_count                       geometry
## 1            1 MULTIPOLYGON (((-122.2469 3...
## 2            1 MULTIPOLYGON (((-122.2574 3...
## 3           NA MULTIPOLYGON (((-122.2642 3...
## 4            2 MULTIPOLYGON (((-122.2618 3...
## 5            1 MULTIPOLYGON (((-122.2694 3...
## 6           NA MULTIPOLYGON (((-122.2681 3...
# join the count back to the Alameda County census tract data frame `tracts_acs_sf_ac2`
tracts_acs_sf_ac2 = st_join(tracts_acs_sf_ac2, 
                            school_counts_by_tract['school_count'])

# take a look at the output
head(tracts_acs_sf_ac2)
## Simple feature collection with 6 features and 54 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2469 ymin: 37.84996 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##            GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD
## 1    06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.8  06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.9  06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.10 06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.11 06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.12 06001400100      06      001  400100 1400000US06001400100   4001   CT
##        ALAND AWATER                                        NAME.y c_race
## 1    6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.8  6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.9  6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.10 6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.11 6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.12 6894340      0 Census Tract 4001, Alameda County, California   3115
##      c_white c_black c_asian c_latinx state_fips county_fips tract_fips
## 1       2055     128     592      104          6           1     400100
## 1.8     2055     128     592      104          6           1     400100
## 1.9     2055     128     592      104          6           1     400100
## 1.10    2055     128     592      104          6           1     400100
## 1.11    2055     128     592      104          6           1     400100
## 1.12    2055     128     592      104          6           1     400100
##      med_rent med_hhinc c_tenants c_owners c_renters c_movers c_stay
## 1        3501    200893      1297     1158       139     3084   2514
## 1.8      3501    200893      1297     1158       139     3084   2514
## 1.9      3501    200893      1297     1158       139     3084   2514
## 1.10     3501    200893      1297     1158       139     3084   2514
## 1.11     3501    200893      1297     1158       139     3084   2514
## 1.12     3501    200893      1297     1158       139     3084   2514
##      c_movelocal c_movecounty c_movestate c_moveabroad c_commute c_car
## 1            202          120         219           29      1542   864
## 1.8          202          120         219           29      1542   864
## 1.9          202          120         219           29      1542   864
## 1.10         202          120         219           29      1542   864
## 1.11         202          120         219           29      1542   864
## 1.12         202          120         219           29      1542   864
##      c_carpool c_transit c_bike c_walk year   p_white    p_black   p_asian
## 1          165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.8        165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.9        165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.10       165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.11       165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.12       165       271      0     10 2018 0.6597111 0.04109149 0.1900482
##        p_latinx  p_owners p_renters    p_stay p_movelocal p_movecounty
## 1    0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.8  0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.9  0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.10 0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.11 0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.12 0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
##      p_movestate p_moveabroad     p_car p_carpool p_transit p_bike      p_walk
## 1     0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.8   0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.9   0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.10  0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.11  0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.12  0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
##      mean_API school_count                       geometry
## 1         864            1 MULTIPOLYGON (((-122.2469 3...
## 1.8       864            1 MULTIPOLYGON (((-122.2469 3...
## 1.9       864            2 MULTIPOLYGON (((-122.2469 3...
## 1.10      864           NA MULTIPOLYGON (((-122.2469 3...
## 1.11      864           NA MULTIPOLYGON (((-122.2469 3...
## 1.12      864           NA MULTIPOLYGON (((-122.2469 3...
# map the output with plot
plot(tracts_acs_sf_ac2['school_count'])

Solution hidden here!

To see it, right-click and select “inspect element” in your browser (or look in the 07_Joins_and_Aggregation.Rmd file around line 670).

7.4 Recap

We discussed how we can combine datasets to enhance any geospatial data analyses you could do. Key concepts include:

  • Attribute joins
    • merge()
  • Spatial joins (order matters!)
    • st_join
  • Aggregation
    • aggregate.sf

 D-Lab @ University of California - Berkeley
 Team Geo